Written Report

Author

Aaron Boone, Deven Singh, Miguel Serrano, Daniel Magelky

Project Proposal Data

Code
library(tidyverse)
library(here)
library(gganimate)

Data Set Descriptions

Both of our data sets have been sourced from the GapMinder website.

For our first data set we chose one that focuses on the GDP ( Gross Domestic Product) per capita of almost 200 countries across the world. GDP per capita takes the whole economic output of a country, and divides it by the country’s population to understand the economic output per person in a country. Lower GDPs per capita means the economy of the given country is poor. The data we have chose highlights the change in GDP per capita from 1950 to 2025 throughout the given countries.

Our second data is centered around Child Mortality rates in countries across the world. The rate of child mortality is measured as the amount of deaths to children 5 years or younger for every 1,000 children born in a country. This data set includes almost 200 countries from around the world and gives us their child mortality rates from 1950 - 2025.

Code
gdp_data <- read_csv(here::here("gdp_pcap.csv"))
child_mortality_data <- read_csv(here::here("child_mortality_0_5_year_olds_dying_per_1000_born.csv"))
Code
child_mortality_long <- child_mortality_data |>
  select(country, `1950`:`2025`) |>
  pivot_longer(cols = `1950`:`2025`, names_to = "Year", values_to = "Death_per_1000") |>
  filter(!is.na(Death_per_1000)) 
Code
gdp_data_long <- gdp_data |>
  select(country, `1950`:`2025`) |>
  pivot_longer(cols = `1950`:`2025`, names_to = "Year", values_to = "GDP") |>
  filter(!is.na(GDP)) |>
  mutate(GDP = if_else(str_detect(GDP, "k"), as.numeric(str_replace(GDP, "k", "")) * 1000, as.numeric(GDP)))

Hypothesis

We are combining our two data sets, GDP per capita and Child Mortality Rates, to examine what type of relationship, if any, there is between the two variables. We are testing how GDP per capita has affected Child Mortality Rates in countries across the world from 1950 to 2025, and if we can find any association between the two. Our hypothesis is that a lower GDP per Capita will be positively associated with a higher Child Mortality Rate. We believe this to be true because the lower economic standing a country has, the harsher living conditions there tends to be meaning an increased possibility of child mortality.

Code
Total_data <- child_mortality_long |>
  inner_join(gdp_data_long, join_by("country", "Year")) |>
  rename("Country" = country)

2 Linear Regression

<<<<<<< HEAD

The statistical method we’re using for this is linear regression. Linear regression is a statistical method that predicts the linear relationship between a quantitative response variable and one or more quantitiative explanatory variables. Specifically in this report, we are using a simple linear regression model, which takes in one explanatory variable, that predicts the linear relationship between our explanatory variable, GDP, and the response variable, Child Mortality rates.

=======

The statistical method we’re using for this is linear regression. Linear regression is a statistical method that predicts the linear relationship between a quantitative response variable and one or more quantitative explanatory variables. In this report, we are using a simple linear regression model which takes in one explanatory variable, that predicts the linear relationship between our explanatory variable, GDP, and the response variable, Child Mortality rates.

>>>>>>> 3b7ab264dbf0d0ff47edef107ab3252154d84e15
Code
avg_Total_data <- Total_data|>
  group_by(Country)|>
  summarize(avg_GDP = mean(GDP), avg_Death_per_1000 = mean(Death_per_1000))

2.1 Data Visualization

From the animate plots we are able to see the relationship between Time and both GDP and Child Mortality. Throughout the 75 year period we chose, 1950 to 2025, we can see generally that GDP has increased exponentially over time and Child Mortality has decreased significantly. We can see there is an inverse association between the two variables as when GDP goes up, Child Mortality goes down. Through the animation we can see dots moving in a diagonal direction indicating the possibility of a linear relationship between the two variables.

<<<<<<< HEAD
bubble <- ggplot(Total_data, aes(x = GDP, y = Death_per_1000, size = GDP)) +
  geom_point() +
  theme_bw() +
  labs(title = 'Year: 1950-2025', x = 'GDP', y = NULL, subtitle = "Child Mortality") +
  transition_time(as.numeric(Year)) +
  ease_aes('linear') 
  gganimate::transition_states(
  Year, 
  transition_length = 1, 
  state_length = 1
) 
=======
Code
bubble <- ggplot(Total_data, aes(x = GDP, y = Death_per_1000, size = GDP)) +
  geom_point() +
  theme_bw() +
  labs(title = 'Year: 1950-2025', x = 'GDP', y = 'Child Mortality') +
  transition_time(as.numeric(Year)) +
  ease_aes('linear') 
  gganimate::transition_states(
  Year, 
  transition_length = 1, 
  state_length = 1
) 
>>>>>>> 3b7ab264dbf0d0ff47edef107ab3252154d84e15
<ggproto object: Class TransitionStates, Transition, gg>
    adjust_nframes: function
    expand_data: function
    expand_layer: function
    expand_panel: function
    finish_data: function
    get_all_row_vars: function
    get_frame_data: function
    get_frame_vars: function
    get_row_vars: function
    map_data: function
    mapping: (.*)
    params: list
    remap_frames: function
    require_late_tween: function
    setup_params: function
    setup_params2: function
    static_layers: function
    unmap_frames: function
    var_names: states
    super:  <ggproto object: Class TransitionStates, Transition, gg>
Code
animate(bubble, renderer = gifski_renderer())
<<<<<<< HEAD

=======

>>>>>>> 3b7ab264dbf0d0ff47edef107ab3252154d84e15
Code
ggplot(avg_Total_data, aes(x = avg_GDP, y = avg_Death_per_1000)) +
  geom_point() +
  labs(x = "Average GDP", y = NULL, subtitle = "Average Child Mortality per 1000 Births", title = "Relationship between Average Child Mortality and Average GDP over time")

2.2 Linear Regression

\[\hat{y} = 123 - .0024x, \text{where}\] \[\hat{y} = \text{Predicted Average Child Mortality Rate} \text{ and } x = \text{Average GDP}\]

From this equation we can gather that when the Average GDP of a country is $0, the predicted Average Child Mortality Rate is 123 deaths per 1000 children. We can also see that with every one dollar increase in Average GDP, the predicted Average Child Mortality rate goes down by .0024 deaths per 1000 children.

Code
linear_regression_model <- lm(avg_Death_per_1000 ~ avg_GDP, data = avg_Total_data)
summary(linear_regression_model)

Call:
lm(formula = avg_Death_per_1000 ~ avg_GDP, data = avg_Total_data)

Residuals:
   Min     1Q Median     3Q    Max 
-85.55 -44.28 -18.56  37.85 149.78 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.230e+02  5.071e+00  24.262   <2e-16 ***
avg_GDP     -2.372e-03  2.405e-04  -9.866   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53.6 on 193 degrees of freedom
Multiple R-squared:  0.3353,    Adjusted R-squared:  0.3318 
F-statistic: 97.34 on 1 and 193 DF,  p-value: < 2.2e-16

2.3 Model Fit

From the linear regression model above, we can estimate that about 33.53% of the variability in the response values are explained by the regression model. This suggests that we have a weak to moderate model for explaining the variability in the data. We can see that the fitted values, which represent the variability of Average Child Mortality rates accounted for by the Average GDP, makes up only 1441.605 of 4299.856 of the response values, which is about .3353, the same as the R-Square value. The residuals, 2858.251 out of the 4299.856 response values, is the rest of the unexplained variability in the data set that isn’t represented in the model.

Code
variance_response <- var(avg_Total_data$avg_Death_per_1000)
variance_fitted <- var(fitted(linear_regression_model))
variance_residuals <- var(residuals(linear_regression_model))

formatted_table <- data.frame(Variance = c(variance_response, variance_fitted, variance_residuals), 
                              Source = c("Response Values", "Fitted Values", "Residuals"))

print(formatted_table)
  Variance          Source
1 4299.856 Response Values
2 1441.605   Fitted Values
3 2858.251       Residuals

3 Simulation

3.1 Visualizing Simulations from the Model

Code
predictions <- predict(linear_regression_model, avg_Total_data)
simulated_values <- predictions + rnorm(length(predictions), 0, sigma(linear_regression_model))

simulated_formatted_table <- data.frame(avg_GDP = avg_Total_data$avg_GDP, avg_Death_per_1000 = simulated_values)

ggplot() +
  geom_point(data = avg_Total_data, aes(x = avg_GDP, y = avg_Death_per_1000), color = "red", alpha = 0.5) + 
  geom_point(data = simulated_formatted_table, aes(x = avg_GDP, y = avg_Death_per_1000), color = "blue", alpha = 0.5) +
  labs(x= "Average GDP", y = NULL, subtitle ="Average Child Mortality per 1000", title = "Comparison of Observed and Simulated Data")
<<<<<<< HEAD

=======

>>>>>>> 3b7ab264dbf0d0ff47edef107ab3252154d84e15

3.2 Generating Multiple Predictive Checks

Code
set.seed(9531)
num_simulations <- 1000
r_squared_values <- numeric(num_simulations)

for (i in 1:num_simulations) {
  simulated_values <- predictions + rnorm(length(predictions), 0, sigma(linear_regression_model))
  simulated_formatted_table <- data.frame(avg_GDP = avg_Total_data$avg_GDP, avg_Death_per_1000 = simulated_values)
  simulated_linear_regression_model <- lm(simulated_formatted_table[,1] ~ simulated_formatted_table[,2], data = formatted_table)
  r_squared_values[i] <- summary(simulated_linear_regression_model)$r.squared
}

ggplot(mapping = aes(x = r_squared_values)
       ) +
  geom_histogram(binwidth = .02, color = "black", fill = "blue") +
  labs(x = "R-Squared",
       y = NULL,
       subtitle = "Frequency",
       title = "R-Squared Distribution"
       )